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^ ■ 7 Abstract 

P^ ■ 8 The study of functional genomics-particularly in non-model organisms has been dramatically 

9 improved over the last few years by use of transcriptomes and RNAseq. While these studies are 

10 potentially extremely powerful, a computationally intensive procedure~the de novo construction 
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of a reference transcriptome must be completed as a prerequisite to further analyses. The 

12 accurate reference is critically important as all downstream steps, including estimating transcript 

.2 i 13 abundance are critically dependent on the construction of an accurate reference. Though a 

T^ ' 14 substantial amount of research has been done on assembly, only recently have the pre-assembly 

qh. 15 procedures been studied in detail. Specifically, several stand-alone error correction modules have 

16 been reported on, and while they have shown to be effective in reducing errors at the level of 

^SJ I 17 sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, 

^ ' 18 we show via use of a simulated dataset, that applying error correction to sequencing reads has 

19 significant positive effects on assembly accuracy, by reducing assembly error by nearly 50%, and 

00 ' 20 therefore should be applied to all datasets. 

p' 

O ■ 21 Author Summary 

en 

Understanding how genomic processes underlie functional biology is a major focus of evolution- 
ary studies. This general area of research has been greatly facilitated by novel high-throughput 

k> ' 24 sequencing techniques, which are now being applied to ecologically interesting non-model organ- 

Vh ■ 25 isms. These studies are empowered by the assembly of a reference transcriptome- a collection 

- - -' 26 of DNA sequence data corresponding to all genes expressed within a given tissue. The accurate 

27 reconstruction of the genetic info is critical, yet current techniques may inadequately handle 

28 erroneous data. Although several general methods for error correction of sequence data ex- 

29 ist, understanding their effects on assembly. Here, we show that error correction dramatically 

30 reduces error in the reference assembly, and therefore should become a part of the standard 

31 assembly workflow. 



32 Introduction 

33 The popularity of genome-enabled biology has increased dramatically, particularly for researchers 

34 studying non-model organisms, over the last few years. For many, the primary goal of these 

35 works is to better understand the genomic underpinnings of adaptive [1,2] or functional [3,4] 

36 traits. While extremely promising, the study of functional genomics in non- model organisms 

37 typically requires the generation of a reference transcriptome to which comparisons are made. 

38 Although compared to genome assembly [5,6] — transcriptome assembly is less challenging, sig- 

39 nificant hurdles still exist (see [7-9] for examples of the types of challenges) . 

40 

41 The process of transcriptome assembly is further complicated by the error-prone nature of high- 

42 throughput sequencing reads. With regards to Illumina sequencing, error is distributed non- 
43 randomly over the length of the read, with the rate of error increasing from 5' to 3' end [10]. 

44 These errors are overwhelmingly substitution errors [11], with the global error rate being be- 

45 tween 1% and 3%. While beyond the focus of this paper- the accuracy of de novo transcriptome 

46 assembly- sequencing errors may have important implications for SNP calling, and the estima- 

47 tion of nucleotide polymorphism and transcript abundance. 

48 

49 With regards to assembly, sequencing read error has both technical and 'real-world' importance. 

50 Because most transcriptome assemblers use a de bruijn graph representation of sequence con- 

51 nectedness, sequencing error can dramatically increase the size and complexity of the graph, and 

52 thus increase both RAM requirements and runtime. More importantly, however, are their effects 

53 on assembly accuracy. Before the current work, sequence assemblers were thought to efficiently 

54 handle error given sufficient sequence coverage. While this is largely true, sequence error may 

55 lead to assembly error at the nucleotide level despite high coverage, and therefore should be 

56 corrected, if possible. In addition, there may be technical, biological, or financial reasons why 

57 extremely deep coverage may not be possible, therefore, a more general solution is warranted. 



59 While the vast majority of computational genomics research has focused on either assem- 

60 bly [5,6,12,13] or transcript abundance estimation [9,14-16], up until recently, research re- 
el garding the dynamics of pre-assembly procedures has largely been missing. However, recently 

62 error correction has become more popular, with several independent software packages becoming 

63 available for error correction- e.g. AllPathsLG error correction [17], Quake [18], Echo [19], 

64 Reptile [20], SOAF denovo [21] and Seecer [22]. While these packages have largely focused 

65 on the error correction of genomic reads (with exception to Seecer, which was designed for 

66 RNAeq reads), they may likely be used as effectively for RNAseq reads. 

67 

68 Recently a review [11] evaluating several of these methods in their ability to correct genomic 

69 sequence read error was published. However, the application of these techniques to RNAseq 

70 reads, as well as an understanding of how error correction influences accuracy of the de novo 

71 transcriptome assembly has not been evaluated. Here we aim to evaluate several of the available 

72 error correction methods, focusing on their effects on transcriptome assembly accuracy. To ac- 

73 complish this, we simulated 50M single-end Illumina reads from the coding regions of the Homo 

74 sapiens genome version hgl9, assembled uncorrected reads, as well as reads corrected by each 

75 of the evaluated correction methods. We evaluate assembly content, and number of erroneous 

76 bases incorporated into the assembly, and mapping efficiency in an attempt to understand the 

77 effects of error correction on assembly. 

78 

79 Because the de novo assembly is a key resource for all subsequent studies of gene expression and 

BO allelic variation, the production of an error-free reference is absolutely critical. Indeed, error in 

81 the reference itself will have potential impacts on the results of downstream analyses. These 

82 types of error may be particularly problematic in de novo assemblies of non-model organisms, 

83 where experimental validation of sequence accuracy may be impossible. Though methods for 

84 the correction of sequencing reads have been available for the last few years, their adoption has 

85 been limited, seemingly because a demonstration of their effects has been lacking. Here, we show 



86 that error correction has a large relative effect on assembly quality, and therefore argue that it 

87 should become a routine part of workflow involved in processing Illumina mRNA sequence data. 



89 Results/Discussion 

90 Fifty million lOOnt single-end (SE) reads were simulated using the program Flux SlMULA- 

91 TOR [23]. Simulated reads were based on the coding portion of the Homo sapiens genome 

92 version hgl9 and included coverage of about 18k transcripts with average depth of 60X. These 

93 reads were qualitatively similar to several published datasets [24,25]. Sequence error follows the 

94 well-characterized Illumina error profile, and is visualized in Figure lA. Similarly, patterns of 

95 gene expressions were typical of many mammalian tissues, and follows a Poisson distribution 

96 with lambda=l [26-28] Figure IB. All downstream analyses, including assemblies and error- 

97 corrected datasets are based on this reference, which will be made publicly available, pending 

98 acceptance. 

99 

100 Error correction of the raw read dataset was completed using the Seecer [22], AllPathsLG, 

101 Echo, and Reptile error correction modules. Details regarding the specific numbers of nu- 

102 cleotide changes and the proportion of reads being affected are detailed in (Tabic 1). Despite 

103 the fact that each software package attempted to solve the same basic problem, runtime con- 

104 siderations and results were quite different. Trinity assembly using the raw simulated reads 

105 produced an assembly consisting of 27,680,000 nucleotides with approximately 199,000 nucleotide 

106 mismatches (Figure 2A), corresponding to an error rate of 0.072%. While the rate of error is 

107 low, and indeed a testament to the general utility the de bruijn graph approach for sequence 

108 assembly, a dramatic improvement in accuracy would be worth pursuing, if possible. 

109 

no Error correction using Reptile proved to be a laborious process, with multiple (>15) individual 

111 executions of the program required for parameter optimization. While each individual run was 



112 relatively quick, the total time exceed 24 hours, with manual intervention and decision making 

113 required at each execution. Error correction resulted in the correction of 7.7M nucleotides (of 

114 a total ~ 5B nucleotides contained in the sequencing read dataset) , but with only a resultant 

115 marginal improvement in assembly accuracy, with a final error rate of 0.060% (Figure 2B). 

116 

117 The next software package that was considered was Echo. Echo works most efficiently when 

118 coverage is between 20X-30X — initial attempts at running the software on the entire dataset 

119 were unsuccessful, presumably because coverage exceeded the specified limits. To run Echo 

120 given these constraints, the simulated sequence read dataset was split into two equal parts, 

121 corrected, then concatenated. Echo does not provide the user any output so as to assess the 

122 number of corrections made nor the number of reads affected, so the specific number of correc- 

123 tions is not available. Though Echo optimizes parameter selection in an automated fashion, 

124 runtime is extremely long. For this dataset, over 7 days of CPU time was required. Despite 

125 this. Echo resulted in an improvement in assembly accuracy to approximately 0.058%. 

126 

127 AllPathsLG error correction software implemented by far the most aggressive correction, 

128 selected optimized parameters in an automated fashion, and did so within a 4 hour runtime. 

129 AllPathsLG corrected over lllM nucleotides (again, out of a total -^ 5B nucleotides contained 

130 in the sequencing reads), which resulted in a final assembly with 105,000 nucleotide errors, cor- 

131 responding to an error rate of approximately 0.038%. This reduction in error rate is highly 

132 significant, and is 47% less than in the assembly of non-error corrected reads. 

133 

134 Seecer, the only dedicated error-correction software package dedicated to RNAseq reads, per- 

135 formed well. Though like Echo, there is no dialogue allowing the used to understand the number 

136 of nucleotide changes made (though the number of reads aff'ected is known), the end result is 

137 promising, with a significant reduction in error- 110,000 nucleotide errors, corresponding to an 

138 error rate of 0.04%. This reduction was achieved using automatic parameter selection, and ap- 



139 proximately 5 hour runtime. 

140 

141 Assembly content, aside from fine-scaled differences at the nucleotide level, as described above, 

142 were equivalent. Assemblies consisted of between 24458-24673 putative transcripts greater than 

143 200nt in length. N50 ranged from 4524-4562nt. The proportion of reads mapping to each 

144 assembled dataset was equivalent as well, ranging from 91.63% using raw reads to 94.21% in 

145 AllPathsLG corrected reads. Lastly, we did not observe an obvious relationship between gene 

146 expression and the quality of the assembled transcripts (Figure 3), including those with signifi- 

147 cant splicing, nor in differences in contiguity of the assembled transcripts between the different 

148 correction methods (Figure 4). Together this suggests that error correction does not have a 

149 significant effect on the structure of assembly- instead, it's effects are limited to enhancing 

150 resolution at the level of the nucleotide. While we did not find, nor expect large differences 

151 in these global metrics, we do expect to see a significant effect on transcriptome based studies 

152 of marker development and population genetics, which are endeavors fundamentally linked to 

153 polymorphism, estimates of which can easily be confused by sequence error. 



155 Methods 

156 Because we were interested in understanding the effects of error correction on the assembly of ver- 

157 tebrate transcriptome assembly, we elected to use coding sequences greater than 200nt in length 

158 from the Homo sapiens reference genome (hgl9), available at http : //hgdownload . cse . ucsc . edu/goldenPath/ 

159 Fifty million lOOnt single-end Illumina reads were simulated with the program Flux Simula- 

160 TOR [23] which attempts to simulate a realistic Illumina RNAseq dataset, incorporating biases 

161 related to library construction and sequencing. Sequencing error increased along the length of 

162 the read, as per program default. Patterns of gene expression were modeled to follow patterns 

163 typically seen in studies of Eukaryotic gene expression. The Flux Simulator requires the use 

164 of a parameter file, which is included in Supporting Information SO. Additionally, the current 
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165 version of the program (1.2) includes a bug where sequence length does not equal the length of 

166 the quality score for a small fraction of simulate reads. Code to remove such reads and their 

167 associated quality scores is included in SI text SI. 

168 

169 Quality metrics for the raw reads were generated using the program SolexaQA [29], and vi- 

170 sualized using R [30]. Patterns of gene expression were validated using the software packages 

171 Bowtie2 [31] and eXpress [32]. Ah computational work was performed on a 16-core 36GB 

172 RAM Linux Ubuntu workstation. 

173 

174 Error correction was performed using four different error correction software packages. These 

175 included Seecer, AllPathsLG error correction, Reptile, and Echo. Each of these programs 

176 implements a different computational strategy for error correction, and therefore their success, 

177 and ultimate effects on assembly accuracy are expected to vary. Additionally, in a recent review 

178 of error correction methods 2 of these 3 (ECHO and Reptile) were shown to be amongst the 

179 most accurate [11]. 

180 

181 Though error correction has been a part of the AllPathsLG genome assembler for the past sev- 

182 eral versions, only recently has a stand-along version of their python-based error correction mod- 

133 ule (http://www.broadinstitute.org/software/allpaths-lg/blog/?p=577) become avail- 

134 able, which leverages several of the AllPaths subroutines, become available. With exception to 
185 the minimum kmer frequency, which was set to 1 (unique kmers discarded in the final corrected 

136 datset), the AllPathsLG error correction software was run using default settings for correcting 

137 errors contained within the raw sequencing reads. Code for running the program is included in 
133 SI text S2. 

189 

190 Error correction using the software package Reptile requires the optimization of several pa- 

191 rameters via an included set of scripts, and therefore several runs of the program. To correct 



192 errors contained within the raw dataset, I set knier size to 25 {KmerLen=25) , and the maximum 

193 error rate to 2% {MaxErrRare=0.02) . Kmer=25 was selected to most closely match the kmer 

194 size used by the assembler Trinity. I empirically determined optimal values for T_expGoodCnt 

195 and T_card using multiple independent program executions. Reptile requires the use of a pa- 

196 rameter file, which is included in SI text S3. 

197 

198 The software package Echo was used to error correct the raw read dataset. The software pack- 

199 age, developed be a group of Computer Scientists at the University of California, Berkeley, is 

200 optimized for use on genome sequencing datasets whose coverage is between 20X and SOX. We 

201 ran ECHO using default settings appropriate for a dataset of this size, but in an attempt at skirt- 

202 ing issues revolving around limited RAM availability, I split the dataset into two equal parts, 

203 performed ECHO error correction, then concatenated the corrected dataset. For this reason, this 

204 dataset may be less-than optimally corrected. Code for running Echo is included in SI text S3. 

205 

206 Lastly, the software package Seecer was used to error correct the raw read dataset. The 

207 software package is fundamentally different than the other packages, in that it was designed for 

208 with RNAseq reads in mind. We ran Seecer using default settings. 

209 Transcriptome assemblies were generated using the default settings of the program Trinity [33] . 

210 Code for running Trinity is included in SI text S5. Assemblies were evaluated using a vari- 

211 ety of different metrics. First, Blast+ [34] was used to match assembled transcripts to their 

212 reference. TransDecoder (http://transdecoder.sourceforge.net/) was used to identify 

213 full-length transcripts. The program Blat [35] was used to identify and count nucleotide mis- 

214 matches between reconstructed transcripts and their corresponding reference. Differences were 

215 visualized using the program R. 



217 Conclusions 

218 To evaluate the effects of correction of sequencing error on assembly accuracy, we generated 

219 a simulated Illumina dataset, which consisted of 50M single-end reads. We attempted error 

220 correction using three popular error correction software packages, and evaluated their effect on 

221 assembly. Though originally developed with genome sequencing in mind, we found that all 

222 tested methods do correct mRNAseq reads, and increase assembly accuracy, though AllPath- 

223 sLG appeared to have the most favorable effect. This study demonstrates the utility of error 

224 correction, and proposes that it become a routine step in the processing of Illumina sequence 

225 data. 
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317 Figure 1 
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Fig. 1. A. The probability of error increases from the 5' to 3' end of the sequencing read. 
This error pattern is typical of Illumina sequencing. B. The log transformed distribution of 
gene expression (FPKM) is typical of many well characterized Eukaryotic tissue types, where 
the vast majority of genes are expressed moderately, with much fewer being expressed at a very 
high level. 
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Figure 2A 
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Figure 2B 
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Fig. 2. Error correction improves assembly accuracy. A. Error correction decreases the number 
of nucleotide mismatches between assembled contig and reference. Error corrected assemblies 
have more perfect matches. B. The global estimate of nucleotide mismatch decreases with error 
correction. The assembly done with AllPathsLG corrected reads has nearly 50% less mismatch 
than does the raw read assembly. 
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335 Fig. 3 The number of nucleotide mismatches in a given contig is not related to gene expression. 

336 On average, poorly expressed transcripts are no more error prone than are highly expressed 

337 transcripts. 
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Figure 4 
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340 Fig. 4. Assembly contiguity did not vary significantly between assemblies of reads using the 

341 different error correction methods. Each error correction methods, as well as assembly of raw 

342 reads, produced an assembly that is dominated by full length (both start and stop codon present) 

343 or nearly full length assembled transcripts. 
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344 Table 1 

345 



Dataset 


Num Reads 


Num reads corr 


Num nt corr 


Runtime 


Raw reads 


50M 


n/a 


n/a 


n/a 


AllPathsLG Corr. 


50M 


>4,033,291 


111,851,115 


~ 8hrs 


Echo Corr. 


50M 


? 


? 


~ 7 days 


Reptile Corr. 


50M 


2,047,088 


7,782,594 


~ 2 hours 


Seecer Corr. 


50M 


7,845,281 


? 


~ 5 hours 



347 Table 1. Number of raw sequencing reads, sequencing reads corrected, nucleotides (nt) corrected, 

348 and approximate runtime for each of the datasets. Note that neither Echo nor Seecer provides 

349 information regarding the number of corrections applied. 

350 Supporting Information 



351 SO: .par file needed to run Flux Simulator 



352 

353 
354 
355 
356 
357 
358 
359 
360 
361 
362 
363 
364 
365 
366 
367 
368 
369 
370 
371 
372 
373 
374 
375 
376 



REF_FILE_NAME hgl9_Ref Seq_fromUCSC 1006 15_ sort ed_nuc . gtf 
GEN_DIR genomes /homo/ 

################################################### Expression 

NB_MOLECULES 6000000 
TSS_MEAN 50 
PDLYA_SCALE 100 
PDLYA_SHAPE 1.5 

############################################### Fragmentation 

FRAG_SUBSTRATE RNA 
FRAG_METHOD UR 
FRAG_UR_ETA 350 
FRAG_UR_DO 100 
RTRANSCRIPTION YES 
RT_MOTIF default 
RT_PRIMER RH 

######################################################### PCR 

PCR_DISTRIBUTION default 
GC_MEAN NaN 
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377 PCR_PROBABILITY 0.05 

378 FILTERING YES 

379 SIZE_DISTRIBUTION N(400,100) 

380 

381 ################################################# Sequencing 

382 

383 READ_NUMBER 50000000 

384 READ_LENGTH 100 

385 PAIRED_END NO 

386 FASTA YES 

387 GEN_DIR genomes/ 

388 ERR_FILE 76 



389 SI: Code to remove reads where sequence length does not equal quality length 



390 cat test.fastq | awk ' BEGIN{OFS = "\n"} { 

391 a[NR 7, 4] = $0; 

392 if(NR 7, 4 == && length (a [2] ) == length (a [0] )) { 
print a[l] , a [2] , a [3] , a [0] 
> 
} ' > sim . f astq 



393 
394 
395 



396 S2: Code for AllPathsLG error correction 

397 ~/ErrorCorrectReads . pi \ 

398 MAX_MEMORY_GB= 30 THREADS= 8 PHRED_ENCODING= 33 PAIRED_READS_A_IN= \ 

399 PAIRED_READS_B_IN= UNPAIRED_READS_IN= sim. f astq \ 

400 FE_MAX_KMER_FREQ_TO_MARK= 1 EC_K= 24 HAPLOIDIFY= True FILL_FRAGMENTS= \ 

401 False FF_K = 28 FE_USE_KMER_SPECTRUM = TRUE READS_OUT= corr 



402 S3: Parameter file for Reptile error correction 



403 InFaFile 


sim . fa 




404 IQFile 


sim . q 




405 OErrFile 


sim . reptile , 


. err 


406 

407 QFlag 


1 




408 BatchSize 


3000000 




409 KmerLen 


13 




410 

411 hd_max 


1 




412 Step 


11 





19 



413 


ExpectSearch 


4 


414 


T_ratio 


0.5 


415 






416 


QThreshold 


73 


417 


MaxBadQPerKmer 


8 


418 


Qlb 


65 


419 


T_expGoodCnt 


56 


420 


T_card 


16 



421 S4: Code for Echo error correction 

422 python "/ echo_vl_ 12/ErrorCorrect ion . py -b 2000000 \ 

423 --nh 2048 --ncpu 8 \ 

424 -o echo . ec . sim . 1 . f astq parti, fastq 

425 python ~/echo_vl_12/ErrorCorrection.py -b 2000000 \ 

426 --nh 2048 --ncpu 8 \ 

427 -o echo . ec . sim . 2 . fastq part2. fastq 

428 cat parti, fastq part2. fastq > echo . corr . fastq 



429 S5 Code for Trinity assembly 



430 ~/trinityrnaseq_r2013-02-25/Trinity .pi --seqType fq --JM 30G \ 

431 --single *. fastq --f ull_cleanup --CPU 8 



